NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

Huang, Ruizhe; Yarmohammadi, Mahsa; Khudanpur, Sanjeev; Povey, Daniel (September 2024, Interspeech)

Existing research suggests that automatic speech recognition (ASR) models can benefit from additional contexts (e.g., contact lists, user specified vocabulary). Rare words and named entities can be better recognized with contexts. In this work, we propose two simple yet effective techniques to improve context-aware ASR models. First, we inject contexts into the encoders at an early stage instead of merely at their last layers. Second, to enforce the model to leverage the contexts during training, we perturb the reference transcription with alternative spellings so that the model learns to rely on the contexts to make correct predictions. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion, making the new state-of-the-art performance. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
more » « less
Full Text Available
ConEC: Earnings Call Dataset with Real-world Contexts for Benchmarking Contextual Speech Recognition

Huang, Ruizhe; Yarmohammadi, Mahsa; Trmal, Jan; Liu, Jing; Raj, Desh; Paola_Garcia, Leibny; Ivanov, Alexei V; Ehlen, Patrick; Yu, Mingzhi; Povey, Dan; et al (May 2024, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024))
Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.)
Knowing the particular context associated with a conversation can help improving the performance of an automatic speech recognition (ASR) system. For example, if we are provided with a list of in-context words or phrases — such as the speaker’s contacts or recent song playlists — during inference, we can bias the recognition process towards this list. There are many works addressing contextual ASR; however, there is few publicly available real benchmark for evaluation, making it difficult to compare different solutions. To this end, we provide a corpus (“ConEC”) and baselines to evaluate contextual ASR approaches, grounded on real-world applications. The ConEC corpus is based on public-domain earnings calls (ECs) and associated supplementary materials, such as presentation slides, earnings news release as well as a list of meeting participants’ names and affiliations. We demonstrate that such real contexts are noisier than artificially synthesized contexts that contain the ground truth, yet they still make great room for future improvement of contextual ASR technology.
more » « less
Full Text Available
ConEC: Earnings Call Dataset with Real-world Contexts for Benchmarking Contextual Speech Recognition

Huang, Ruizhe; Yarmohammadi, Mahsa; Trmal, Jan; Liu, Jing; Raj, Desh; Garcia, Leibny P; Ivanov, Alexei; Ehlen, Patrick; Yu, Mingzhi; Povey, Dan; et al (May 2024, ELRA and ICCL)

Knowing the particular context associated with a conversation can help improving the performance of an automatic speech recognition (ASR) system. For example, if we are provided with a list of in-context words or phrases — such as the speaker’s contacts or recent song playlists — during inference, we can bias the recognition process towards this list. There are many works addressing contextual ASR; however, there is few publicly available real benchmark for evaluation, making it difficult to compare different solutions. To this end, we provide a corpus (“ConEC”) and baselines to evaluate contextual ASR approaches, grounded on real-world applications. The ConEC corpus is based on public-domain earnings calls (ECs) and associated supplementary materials, such as presentation slides, earnings news release as well as a list of meeting participants’ names and affiliations. We demonstrate that such real contexts are noisier than artificially synthesized contexts that contain the ground truth, yet they still make great room for future improvement of contextual ASR technology
more » « less
Full Text Available
Building Keyword Search System from End-To-End Asr Systems

https://doi.org/10.1109/ICASSP49357.2023.10097249

Huang, Ruizhe; Wiesner, Matthew; Garcia-Perera, Leibny Paola; Povey, Dan; Trmal, Jan; Khudanpur, Sanjeev (June 2023, CASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))

Full Text Available

Search for: All records